Skip to content

[LEADS-348] evaluation latency#235

Merged
asamal4 merged 1 commit into
lightspeed-core:mainfrom
xmican10:LEADS-348-evaluation-latency
May 26, 2026
Merged

[LEADS-348] evaluation latency#235
asamal4 merged 1 commit into
lightspeed-core:mainfrom
xmican10:LEADS-348-evaluation-latency

Conversation

@xmican10

@xmican10 xmican10 commented May 14, 2026

Copy link
Copy Markdown
Collaborator

Description

Depends on #230
Further changes for LEADS-348:

  • result.execution_time was renamed to result.evaluation_latency which represents better that it is measuring the evaluation
  • rounding of result.evaluation_latency was removed for consistency, since other timing metrics are not being rounded
  • result.execution_time was reintroduced (for backward compatibility), but currently measures the whole api execution and evaluation per turn
  • test_evaluation.py was refactored - redundant mocking was moved to pytest fixture (~500 lines reduced)

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
  • Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features

    • Added a separate evaluation latency metric and now reports execution time as the sum of evaluation + agent latencies; the new field appears in CSV exports and JSON outputs.
  • Tests

    • Updated and expanded tests and fixtures to validate evaluation latency capture, execution-time computation, and CSV/DB output consistency.

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

This PR refactors evaluation result timing from a single execution_time field to a decomposed model where evaluation_latency and agent_latency are measured separately and execution_time equals their sum. The change spans data model, evaluator logic, storage, output, and comprehensive test updates.

Changes

Timing Model Refactoring

Layer / File(s) Summary
Schema and configuration updates
config/system.yaml, src/lightspeed_evaluation/core/constants.py, src/lightspeed_evaluation/core/models/data.py
Configuration and data model fields updated to support evaluation_latency as a CSV column and field; timing fields now use float defaults with clarified descriptions.
Evaluator timing measurement and computation
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
Added _measure_latency() using time.perf_counter(). evaluate_metric() captures start_time, computes evaluation_latency, records agent_latency, and sets execution_time = evaluation_latency + agent_latency. Error paths now pass start_time to _create_error_result().
Storage and output serialization
src/lightspeed_evaluation/core/storage/sql_storage.py, src/lightspeed_evaluation/core/output/generator.py
ORM model adds nullable evaluation_latency column and mapper persists it. JSON/CSV output generation now includes evaluation_latency; per-column formatting for execution_time was simplified.
Test fixtures and system config test
tests/script/conftest.py, tests/unit/pipeline/evaluation/conftest.py, tests/unit/core/system/test_loader.py
Script fixtures updated to use evaluation_latency. New evaluator pytest fixture wires a MetricsEvaluator with mocked dependencies. Added test to load and validate default config/system.yaml.
Data model, storage and output tests
tests/unit/core/models/test_data.py, tests/unit/core/models/test_summary.py, tests/unit/core/output/test_generator.py, tests/unit/core/storage/test_sql_storage.py
Unit tests updated to assert defaults/validation for evaluation_latency, test fixtures adjusted, and storage/output tests check evaluation_latency persistence and CSV/DB column presence.
Evaluator behavior and error tests
tests/unit/pipeline/evaluation/test_evaluator.py, tests/unit/pipeline/evaluation/test_errors.py
Evaluator test suite refactored to use injected evaluator fixture; added/updated assertions for evaluation_latency, agent_latency, execution_time computation, error-result field preservation, expected-response handling, and token accounting (no double counting) across multi-path and error scenarios.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • asamal4
  • VladimirKadlec
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[LEADS-348] evaluation latency' clearly describes the main objective: introducing evaluation latency tracking and measurement. It directly matches the primary changes across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 96.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@xmican10 xmican10 force-pushed the LEADS-348-evaluation-latency branch from 3eed19a to 14e437e Compare May 15, 2026 08:14
@xmican10 xmican10 changed the title Leads 348 evaluation latency [LEADS-348] evaluation latency May 15, 2026
@xmican10 xmican10 force-pushed the LEADS-348-evaluation-latency branch 2 times, most recently from 8eb287c to e9c4b17 Compare May 20, 2026 07:42
@xmican10 xmican10 marked this pull request as ready for review May 20, 2026 07:44

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/core/output/generator.py`:
- Line 817: The JSON summary currently adds "evaluation_latency":
r.evaluation_latency but removed the older "execution_time" field; restore
compatibility by including "execution_time" in the emitted dict (e.g., set
"execution_time": r.execution_time or, if r has no execution_time, set it to the
same value as r.evaluation_latency) alongside "evaluation_latency" in the output
constructed where the dict containing "evaluation_latency" is created so older
consumers still receive execution_time.

In `@src/lightspeed_evaluation/core/storage/sql_storage.py`:
- Line 60: The SQL schema and persistence code currently define/persist only
evaluation_latency, dropping the previously stored execution_time; add back
execution_time = Column(Float, nullable=True) alongside evaluation_latency in
the model definition and update the persistence logic that currently writes
evaluation_latency (e.g., in the function that inserts/updates evaluation
records—look for code referencing evaluation_latency around the save/insert
method) to also set execution_time from the same source value so existing
consumers continue receiving execution_time.

In `@tests/unit/pipeline/evaluation/test_errors.py`:
- Line 160: The test currently asserts results[0].evaluation_latency but misses
verifying the backward-compatible execution_time field on error results; update
the test in tests/unit/pipeline/evaluation/test_errors.py to also assert that
results[0].execution_time equals 0.0 alongside the existing evaluation_latency
assertion so error results explicitly include the legacy execution_time field.

In `@tests/unit/pipeline/evaluation/test_evaluator.py`:
- Around line 583-605: The test
test_execution_time_conversation_level_averages_agent_latency incorrectly
asserts result.agent_latency == 4.0; update the assertion to expect the average
agent latency of the conversation (2.0) since TurnData instances have
agent_latency 1.0 and 3.0 and EvaluationRequest.for_conversation should compute
the mean; change the assertion referencing result.agent_latency in this test to
assert result.agent_latency == 2.0 and keep the other
execution_time/evaluation_latency checks unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 56132854-62fe-4236-8dc5-3380d17e64a8

📥 Commits

Reviewing files that changed from the base of the PR and between 377a66b and e9c4b17.

📒 Files selected for processing (15)
  • config/system.yaml
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/storage/sql_storage.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • tests/script/conftest.py
  • tests/unit/core/models/test_data.py
  • tests/unit/core/models/test_summary.py
  • tests/unit/core/output/test_generator.py
  • tests/unit/core/storage/test_sql_storage.py
  • tests/unit/core/system/test_loader.py
  • tests/unit/pipeline/evaluation/conftest.py
  • tests/unit/pipeline/evaluation/test_errors.py
  • tests/unit/pipeline/evaluation/test_evaluator.py

Comment thread src/lightspeed_evaluation/core/output/generator.py
Comment thread src/lightspeed_evaluation/core/storage/sql_storage.py
Comment thread tests/unit/pipeline/evaluation/test_errors.py
Comment thread tests/unit/pipeline/evaluation/test_evaluator.py Outdated
@xmican10 xmican10 force-pushed the LEADS-348-evaluation-latency branch from e9c4b17 to 457af2f Compare May 20, 2026 08:12

@asamal4 asamal4 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. A minor typo !!

Comment thread tests/unit/pipeline/evaluation/test_evaluator.py Outdated
@xmican10 xmican10 force-pushed the LEADS-348-evaluation-latency branch from 457af2f to 39af396 Compare May 25, 2026 11:13

@VladimirKadlec VladimirKadlec left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@xmican10 xmican10 force-pushed the LEADS-348-evaluation-latency branch from 39af396 to 6d2fd1d Compare May 26, 2026 07:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unit/core/storage/test_sql_storage.py (1)

269-314: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add assertion for execution_time in comprehensive storage test.

The test test_all_fields_stored verifies that all EvaluationResult fields are persisted correctly. It sets evaluation_latency=2.5 (line 283) and asserts row.evaluation_latency == 2.5 (line 312), but doesn't verify execution_time is stored correctly.

Per lines 417-428, both evaluation_latency and execution_time are DB columns. This test should assert row.execution_time to ensure backward compatibility and verify the relationship between timing fields is preserved through storage.

🔍 Suggested assertion to add
         assert row is not None
         assert row.conversation_group_id == "conv_full"
         assert row.score == 0.92
         assert row.evaluation_latency == 2.5
+        assert row.execution_time is not None  # Verify backward compat field is stored
         assert row.tool_calls == '[{"name": "search"}]'

If execution_time should equal evaluation_latency + agent_latency, verify that relationship:

# Assuming agent_latency defaults to 0 when not set
assert row.execution_time == 2.5  # or appropriate computed value
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/storage/test_sql_storage.py` around lines 269 - 314, The
test_all_fields_stored test creates an EvaluationResult with
evaluation_latency=2.5 but never asserts the execution_time DB column; update
the test to read row.execution_time and assert it equals the expected value
(likely 2.5 if agent latency defaults to 0 or evaluation_latency + agent_latency
if agent_latency is present) after saving via
SQLStorageBackend.initialize/save_result; reference the EvaluationResult
instance, the evaluation_latency field, and the execution_time DB column to
ensure the stored execution_time is validated.
🧹 Nitpick comments (3)
tests/unit/core/models/test_summary.py (1)

29-29: ⚡ Quick win

Consider including execution_time in test fixture for completeness.

The fixture now sets evaluation_latency: 1.0 but doesn't set execution_time. Per the PR summary, execution_time is retained for backward compatibility and should equal evaluation_latency + agent_latency. Consider explicitly setting execution_time in the fixture to create realistic test data and verify the field is properly handled throughout the summary generation logic.

♻️ Suggested addition to fixture
 _RESULT_DEFAULTS: dict[str, Any] = {
     "conversation_group_id": "conv1",
     "tag": "eval",
     "turn_id": "turn1",
     "metric_identifier": "ragas:faithfulness",
     "result": "PASS",
     "score": 0.85,
     "threshold": 0.7,
     "reason": "Good",
     "evaluation_latency": 1.0,
+    "agent_latency": 0.5,
+    "execution_time": 1.5,
     "judge_llm_input_tokens": 100,
     "judge_llm_output_tokens": 50,
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/models/test_summary.py` at line 29, The test fixture in
tests/unit/core/models/test_summary.py sets "evaluation_latency" but omits
"execution_time"; update the fixture to include "execution_time" and set it
equal to evaluation_latency + agent_latency (so execution_time =
evaluation_latency + agent_latency) to match the PR behavior and ensure summary
generation and backward-compat checks (e.g., in any assertions referencing
execution_time) use realistic data.
tests/unit/core/storage/test_sql_storage.py (1)

53-53: ⚡ Quick win

Consider setting execution_time in test fixture for backward compatibility verification.

The fixture sets evaluation_latency=1.5 but doesn't set execution_time or agent_latency. Per the PR summary and line 428 of this file, execution_time is retained as a DB column for backward compatibility. The test fixture should either:

  1. Explicitly set execution_time to verify it's stored correctly, or
  2. Verify the model auto-computes execution_time = evaluation_latency + agent_latency

Without this, tests may not catch issues where execution_time is NULL or incorrectly computed.

♻️ Suggested fixture enhancement
     return EvaluationResult(
         conversation_group_id="conv_001",
         tag="test",
         turn_id="turn_1",
         metric_identifier="ragas:faithfulness",
         metric_metadata='{"threshold": 0.8}',
         result="PASS",
         score=0.95,
         threshold=0.8,
         reason="Good response",
         query="What is Python?",
         response="Python is a programming language.",
         evaluation_latency=1.5,
+        agent_latency=0.0,
+        execution_time=1.5,
         api_input_tokens=100,
         api_output_tokens=50,
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/storage/test_sql_storage.py` at line 53, The test fixture in
tests/unit/core/storage/test_sql_storage.py currently sets
evaluation_latency=1.5 but omits execution_time and agent_latency; update the
fixture (the evaluation row/setup used in the tests) to either explicitly set
execution_time (e.g., execution_time=<expected value>) to verify DB persistence,
or set agent_latency and then add an assertion in the relevant test that
execution_time == evaluation_latency + agent_latency so the code path that
computes/stores execution_time is exercised; refer to the fields execution_time,
evaluation_latency, and agent_latency when making the change.
tests/unit/core/models/test_data.py (1)

472-472: ⚡ Quick win

Add an execution_time default assertion to the test
EvaluationResult.execution_time is defined with default=0.0 alongside evaluation_latency=0.0; extend tests/unit/core/models/test_data.py (line 472) to assert result.execution_time == 0.0 in addition to result.evaluation_latency == 0.0.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/models/test_data.py` at line 472, The test currently asserts
result.evaluation_latency == 0 but misses the sibling default; add an assertion
for EvaluationResult.execution_time by verifying result.execution_time == 0.0
immediately alongside the existing evaluation_latency check (i.e., update the
test that references result to include result.execution_time == 0.0).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/unit/core/storage/test_sql_storage.py`:
- Around line 269-314: The test_all_fields_stored test creates an
EvaluationResult with evaluation_latency=2.5 but never asserts the
execution_time DB column; update the test to read row.execution_time and assert
it equals the expected value (likely 2.5 if agent latency defaults to 0 or
evaluation_latency + agent_latency if agent_latency is present) after saving via
SQLStorageBackend.initialize/save_result; reference the EvaluationResult
instance, the evaluation_latency field, and the execution_time DB column to
ensure the stored execution_time is validated.

---

Nitpick comments:
In `@tests/unit/core/models/test_data.py`:
- Line 472: The test currently asserts result.evaluation_latency == 0 but misses
the sibling default; add an assertion for EvaluationResult.execution_time by
verifying result.execution_time == 0.0 immediately alongside the existing
evaluation_latency check (i.e., update the test that references result to
include result.execution_time == 0.0).

In `@tests/unit/core/models/test_summary.py`:
- Line 29: The test fixture in tests/unit/core/models/test_summary.py sets
"evaluation_latency" but omits "execution_time"; update the fixture to include
"execution_time" and set it equal to evaluation_latency + agent_latency (so
execution_time = evaluation_latency + agent_latency) to match the PR behavior
and ensure summary generation and backward-compat checks (e.g., in any
assertions referencing execution_time) use realistic data.

In `@tests/unit/core/storage/test_sql_storage.py`:
- Line 53: The test fixture in tests/unit/core/storage/test_sql_storage.py
currently sets evaluation_latency=1.5 but omits execution_time and
agent_latency; update the fixture (the evaluation row/setup used in the tests)
to either explicitly set execution_time (e.g., execution_time=<expected value>)
to verify DB persistence, or set agent_latency and then add an assertion in the
relevant test that execution_time == evaluation_latency + agent_latency so the
code path that computes/stores execution_time is exercised; refer to the fields
execution_time, evaluation_latency, and agent_latency when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: be024615-6acf-4965-a895-19197c932bc2

📥 Commits

Reviewing files that changed from the base of the PR and between e9c4b17 and 6d2fd1d.

📒 Files selected for processing (15)
  • config/system.yaml
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/storage/sql_storage.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • tests/script/conftest.py
  • tests/unit/core/models/test_data.py
  • tests/unit/core/models/test_summary.py
  • tests/unit/core/output/test_generator.py
  • tests/unit/core/storage/test_sql_storage.py
  • tests/unit/core/system/test_loader.py
  • tests/unit/pipeline/evaluation/conftest.py
  • tests/unit/pipeline/evaluation/test_errors.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
✅ Files skipped from review due to trivial changes (1)
  • src/lightspeed_evaluation/core/constants.py

@asamal4 asamal4 merged commit 59240a0 into lightspeed-core:main May 26, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants